Take Home 3

Author

Guan Jhen LIN

Published

May 31, 2024

Modified

June 9, 2024

Credit: John Blank Photography

1. Introduction

FishEye International monitors business records of commercial fishing operators in Oceanus to identify and prevent illegal fishing. Analysts work with company data, including ownership, shareholders, transactions, and products/services, to build the CatchNet Knowledge Graph.

Last year, SouthSeafood Express Corp was caught illegally fishing, leading to its closure. FishEye wants to understand the temporal patterns and infer how the fishing market reacted to this event, as some businesses may have tried to capture SouthSeafood’s market share, while others may have become more cautious about illegal activities.

FishEye aims to develop visualization tools for CatchNet to identify influential people in business networks, considering the varied and changing shareholder and ownership relationships.

2. Mini-Challenge 3: Temporal Analysis

2.1 Tasks and Questions:

  • Develop an approach using visual analytics to highlight temporal patterns and changes in corporate structures, focusing on identifying the most active individuals and businesses.

  • Utilize visualizations to display typical and atypical business transactions, such as mergers and acquisitions, and infer the underlying motivations behind changes in their activity levels.

  • Create a visual approach to examine inferences about how a company’s influence evolves over time and whether ownership or influence over a network can be deduced.

  • Identify and visualize the network of SouthSeafood Express Corp and competing businesses, and how they changed due to SouthSeafood’s illegal fishing behavior.

  • Determine which companies benefited from SouthSeafood’s legal troubles and if there are other suspicious transactions related to illegal fishing, providing visual evidence to support your conclusions.

Data Source: VAST Challenge 2024: Mini-Challenge 3

3. Data

The MC3 dataset is a comprehensive collection comprising 60,520 nodes (entities) and 75,817 edges (relationships or connections) organized into 4,782 distinct components. The nodes in this dataset represent various types of entities, including individuals (Person), chief executive officers (CEO), companies, and other organizational structures.

On the other hand, the edges capture different types of relationships or interactions between these nodes. Some examples of edge types include shareholdership (ownership of shares in a company), beneficial ownership (enjoying the benefits of owning a property or asset without being the legal owner), and potentially other forms of associations or transactions.

The dataset is structured such that the nodes key contains a list representing all the node entities, with each node carrying attributes or properties that describe its characteristics, such as ID, type, country, revenue, founding date, and potentially other relevant details.

Correspondingly, the links key holds a list that represents all the edges or connections between these nodes. Each edge entry typically includes properties like the edge type, start and end dates (if applicable), and identifiers for the source and target nodes involved in the relationship.

With this comprehensive dataset capturing both node entities and their interconnections through various types of edges, researchers and analysts can conduct in-depth analyses to uncover patterns, understand the dynamics of relationships, and gain insights into the complex network of entities and their interactions.

3.1 Data Preparations

  • jsonlite: This package provides functionality for parsing and generating JSON data in R.
  • dplyr: A part of the tidyverse, dplyr provides a consistent set of verbs for data manipulation, making it easier to transform and summarize data frames.
  • tidyr: Another tidyverse package, tidyr helps in creating tidy data sets by providing functions for reshaping data frames.
  • stringr: This package provides a cohesive set of functions for string manipulation and regular expressions in R.
  • lubridate: lubridate is designed to make it easier to work with date-time data in R.
  • tidyverse: The tidyverse is a collection of R packages designed for data science, including dplyr, ggplot2, tidyr, and others.
  • readtext: readtext makes it easier to import and handle text data in R, particularly for text mining and analysis.
  • ggplot2: Part of the tidyverse, ggplot2 is a powerful data visualization package for creating complex and publication-ready plots.
  • visNetwork: visNetwork is a package for creating interactive network visualizations in R, using the vis.js library.
  • ggpubr: ggpubr provides easy-to-use functions for creating publication-ready plots and combining multiple ggplot2 plots into a single figure.
  • igraph: igraph is a collection of network analysis tools for creating and analyzing graphs and networks in R.
  • patchwork: patchwork is a package for composing multiple ggplot2 plots into a single figure, with easy layout control.
Click to show code
pacman::p_load(jsonlite,dplyr,tidyr,stringr,lubridate,tidyverse,readtext,ggplot2,visNetwork,stringr,ggpubr, igraph, patchwork)
Click to show code
mc3 <- fromJSON("data/mc3.json")

To effectively perform temporal analysis on such data, it can be beneficial to separate the data set into two distinct data frames: one for the node entities and another for the relationships or edges between them.

Click to show code
# Load necessary libraries
library(dplyr)
library(tidyr)
library(stringr)

# Assuming mc3 is already loaded as a list containing nodes dataframe
mc3_nodes <- as_tibble(mc3$nodes) %>%
  mutate(
    id = as.character(id),
    type = as.character(type),
    country = as.character(country),
    ProductServices = as.character(ProductServices),
    revenue_omu = as.numeric(revenue),
    head_of_org = as.character(HeadOfOrg),
    TradeDescription = as.character(TradeDescription),
    PointOfContact = as.character(PointOfContact),
    id = gsub("^[0-9]+\\.\\s*", "", id),  # Clean up IDs
    revenue_omu = ifelse(is.na(revenue_omu), 0, revenue_omu)
  ) %>%
  mutate(
    type = ifelse(type == "Entity.Person", "Entity.Person.Person", type)  # Rename Entity.Person
  ) %>%
  rename(
    last_edited_by = `_last_edited_by`,
    last_edited_date = `_last_edited_date`,
    date_added = `_date_added`,
    raw_source = `_raw_source`,
    algorithm = `_algorithm`
  ) %>%
  select(everything())

# Display the first few rows of the updated dataframe
print(head(mc3_nodes))
# A tibble: 6 × 17
  type    country ProductServices PointOfContact HeadOfOrg founding_date revenue
  <chr>   <chr>   <chr>           <chr>          <chr>     <chr>           <dbl>
1 Entity… Uziland Unknown         Rebecca Lewis  Émilie-S… 1954-04-24T0…   5995.
2 Entity… Mawala… Furniture and … Michael Lopez  Honoré L… 2009-06-12T0…  71767.
3 Entity… Uzifri… Food products   Steven Robert… Jules La… 2029-12-15T0…      0 
4 Entity… Islava… Unknown         Anthony Wyatt  Dr. Víct… 1972-02-16T0…      0 
5 Entity… Oceanus Unknown         Tracy Schmidt  Jacques … 1954-04-06T0…   4747.
6 Entity… Imazam  Fish, crustace… Corey Moore J… Thierry … 2031-09-30T0…  46567.
# ℹ 10 more variables: TradeDescription <chr>, last_edited_by <chr>,
#   last_edited_date <chr>, date_added <chr>, raw_source <chr>,
#   algorithm <chr>, id <chr>, dob <chr>, revenue_omu <dbl>, head_of_org <chr>
(mc3_nodes$type) %>% unique()
[1] "Entity.Organization.Company"         
[2] "Entity.Organization.LogisticsCompany"
[3] "Entity.Organization.FishingCompany"  
[4] "Entity.Organization.FinancialCompany"
[5] "Entity.Organization.NewsCompany"     
[6] "Entity.Organization.NGO"             
[7] "Entity.Person.Person"                
[8] "Entity.Person.CEO"                   
# Further processing to separate the 'type' column
mc3_nodes <- mc3_nodes %>%
  separate(type, into = c("type_1", "type_2", "type_3"), sep = "\\.", fill = "right", extra = "drop")

# Display the first few rows of the updated dataframe
print(head(mc3_nodes))
# A tibble: 6 × 19
  type_1 type_2       type_3  country   ProductServices PointOfContact HeadOfOrg
  <chr>  <chr>        <chr>   <chr>     <chr>           <chr>          <chr>    
1 Entity Organization Company Uziland   Unknown         Rebecca Lewis  Émilie-S…
2 Entity Organization Company Mawalara  Furniture and … Michael Lopez  Honoré L…
3 Entity Organization Company Uzifrica  Food products   Steven Robert… Jules La…
4 Entity Organization Company Islavara… Unknown         Anthony Wyatt  Dr. Víct…
5 Entity Organization Company Oceanus   Unknown         Tracy Schmidt  Jacques …
6 Entity Organization Company Imazam    Fish, crustace… Corey Moore J… Thierry …
# ℹ 12 more variables: founding_date <chr>, revenue <dbl>,
#   TradeDescription <chr>, last_edited_by <chr>, last_edited_date <chr>,
#   date_added <chr>, raw_source <chr>, algorithm <chr>, id <chr>, dob <chr>,
#   revenue_omu <dbl>, head_of_org <chr>
(mc3_nodes$type_2) %>% unique()
[1] "Organization" "Person"      
(mc3_nodes$type_3) %>% unique()
[1] "Company"          "LogisticsCompany" "FishingCompany"   "FinancialCompany"
[5] "NewsCompany"      "NGO"              "Person"           "CEO"             
Click to show code
mc3_edges <- as_tibble(mc3$links) %>%
  mutate(
    type = as.character(type),
    type_new = stringr::str_extract(type, "[^.]+$"),
    source = as.character(source),
    target = as.character(target),
    last_edited_by = as.character(`_last_edited_by`),
    last_edited_date = as.Date(`_last_edited_date`),
    date_added = as.Date(`_date_added`),
    raw_source = as.character(`_raw_source`),
    algorithm = as.character(`_algorithm`)
  ) %>%
  mutate(
    start_date = as.Date(start_date), 
    end_date = if_else(is.na(end_date), as.Date(NA), as.Date(end_date)),
    id = target
  ) %>%
  # Remove `_last_edited_by`, `_last_edited_date`, `_date_added`, `_raw_source`, `_algorithm` if they exist
  select(-c(`_last_edited_by`, `_last_edited_date`, `_date_added`, `_raw_source`, `_algorithm`)) %>%

  rename(
    last_edited_by = last_edited_by,
    last_edited_date = last_edited_date,
    date_added = date_added,
    raw_source = raw_source,
    algorithm = algorithm
  ) %>%

  select(everything())
Click to show code
  mc3_edges <- mc3_edges %>%
    mutate(
      start_date = as.POSIXct(start_date),
      last_edited_date = as.POSIXct(last_edited_date),
      date_added = as.POSIXct(date_added),
      end_date = if_else(is.na(end_date), as.POSIXct(NA), as.POSIXct(end_date)),
      id = target
    )

mc3_nodes <- mc3_nodes %>%
  mutate(
    last_edited_date = as.POSIXct(last_edited_date),
    date_added = as.POSIXct(date_added),
    founding_date = as.POSIXct(founding_date),
    dob = as.POSIXct(dob)
  )
Click to show code
mc3_edges [duplicated(mc3_edges ),]
# A tibble: 0 × 13
# ℹ 13 variables: start_date <dttm>, type <chr>, source <chr>, target <chr>,
#   key <int>, end_date <dttm>, type_new <chr>, last_edited_by <chr>,
#   last_edited_date <dttm>, date_added <dttm>, raw_source <chr>,
#   algorithm <chr>, id <chr>
mc3_nodes [duplicated(mc3_nodes ),]
# A tibble: 0 × 19
# ℹ 19 variables: type_1 <chr>, type_2 <chr>, type_3 <chr>, country <chr>,
#   ProductServices <chr>, PointOfContact <chr>, HeadOfOrg <chr>,
#   founding_date <dttm>, revenue <dbl>, TradeDescription <chr>,
#   last_edited_by <chr>, last_edited_date <dttm>, date_added <dttm>,
#   raw_source <chr>, algorithm <chr>, id <chr>, dob <dttm>, revenue_omu <dbl>,
#   head_of_org <chr>
Click to show code
nodes <- mc3_nodes %>%
  select(
    type_2, type_3, id, dob, country, 
    head_of_org, revenue, last_edited_by, 
    last_edited_date, date_added, raw_source, algorithm
  )

edges <- mc3_edges %>%
  select(
    type_new, id, start_date, end_date, 
    last_edited_by, last_edited_date, date_added, 
    raw_source, algorithm
  )

What are the Company Types?

# Count the unique values in type_2
type_2_unique_counts <- mc3_nodes %>%
  group_by(type_2) %>%
  summarise(count = n()) %>%
  arrange(desc(count))

# Display the summary table
print(type_2_unique_counts)
# A tibble: 2 × 2
  type_2       count
  <chr>        <int>
1 Person       51649
2 Organization  8871

Extracting the top 10 id to see the different company types in edges and nodes file.

Click to show code
top_10_ids <- edges %>%
  group_by(id) %>%
  summarize(count = n()) %>%
  arrange(desc(count)) %>%
  slice(1:10)

Create a barchart for nodes and edges.

Click to show code
barchart_type_counts_nodes <- mc3_nodes %>%
  count(type_3, sort = TRUE)

barchart_type_counts_edges <- mc3_edges %>%
  count(type_new, sort = TRUE)

b1 <- ggplot(barchart_type_counts_nodes, aes(x = reorder(type_3, -n), y = n)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  labs(
    title = "Distribution of Nodes Types",
    x = "Nodes Type",
    y = "Count"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 0, hjust = 0.8, size = 7)
  )

b2 <- ggplot(barchart_type_counts_edges, aes(x = reorder(type_new, -n), y = n)) +
  geom_bar(stat = "identity", fill = "lightblue") +
  labs(
    title = "Distribution of Edges Types",
    x = "Edge Type",
    y = "Count"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 0, hjust = 0.5, size = 7)
  )
combined_plot <- b1 / b2
print(combined_plot)

Top companies and the Types

Network Graph
Click to show code
library(dplyr)
library(igraph)
library(visNetwork)

# Assuming mc3_edges is already loaded as a dataframe

# Get the top 20 companies based on the count of edges
top_companies <- mc3_edges %>%
  count(id, sort = TRUE) %>%
  top_n(20, wt = n)

# Filter edges to include only those involving the top 20 companies
filtered_edges <- mc3_edges %>%
  filter(id %in% top_companies$id) %>%
  select(id, type_new)

# Prepare edges for graph creation
edges_for_graph <- filtered_edges %>%
  rename(from = id, to = type_new)

# Create a bipartite graph
bipartite_graph <- graph_from_data_frame(d = edges_for_graph, directed = FALSE)

# Assign types for bipartite mapping
V(bipartite_graph)$type <- bipartite_mapping(bipartite_graph)$type

# Prepare nodes for visualization
nodes_vis <- data.frame(
  id = V(bipartite_graph)$name, 
  label = V(bipartite_graph)$name, 
  group = ifelse(V(bipartite_graph)$type, "Type", "Company")
)

# Prepare edges for visualization
edges_vis <- igraph::as_data_frame(bipartite_graph, what = "edges")

# Create and customize the visNetwork graph
vis1 <- visNetwork(nodes_vis, edges_vis, width = "100%", height = "800px") %>%
        visNodes(shape = "dot", scaling = list(label = list(enabled = TRUE))) %>%
        visEdges(arrows = "none", color = list(color = "lightgray", highlight = "red")) %>%
        visOptions(highlightNearest = TRUE, nodesIdSelection = TRUE) %>%
        visLegend() %>%
        visGroups(groupname = "Type", shape = "dot", color = "blue") %>%
        visGroups(groupname = "Company", shape = "dot", color = "green") %>%
        visLayout(randomSeed = 21) %>%
        visInteraction(navigationButtons = TRUE) %>%
        visPhysics(stabilization = FALSE, enabled = FALSE)
# Display the graph
vis1

Companies and amount of types

Click to show code
top_10_ids <- mc3_edges %>%
  group_by(id) %>%
  summarize(count = n()) %>%
  arrange(desc(count)) %>%
  slice(1:10)

filtered_edges <- mc3_edges %>%
  filter(id %in% top_10_ids$id)

type_new_counts <- filtered_edges %>%
  group_by(id, type_new) %>%
  summarise(count = n(), .groups = "drop") %>%
  arrange(desc(count))

type_new_colors <- c("Shareholdership" = "pink", 
                     "BeneficialOwnership" = "lightblue", 
                     "WorksFor" = "lightgreen")

wrap_text <- function(text, width) {
  wrapped_text <- str_replace_all(text, "-", " ")
  wrapped_text <- str_wrap(wrapped_text, width = width)
  wrapped_text <- str_replace_all(wrapped_text, " ", "-")
  return(wrapped_text)
}

type_new_counts <- type_new_counts %>%
  mutate(id_wrapped = wrap_text(id, 10)) 

type_new_counts <- type_new_counts %>%
  mutate(id_wrapped = factor(id_wrapped, levels = type_new_counts %>%
                               group_by(id_wrapped) %>%
                               summarize(total_count = sum(count)) %>%
                               arrange(desc(total_count)) %>%
                               pull(id_wrapped)))

b4 <- ggplot(type_new_counts, aes(x = id_wrapped, y = count, fill = type_new)) +
  geom_bar(stat = "identity", position = position_dodge(width = 0.9), width = 0.7) +
  geom_text(aes(label = count), position = position_dodge(width = 0.9), vjust = 0.8, size = 3) + # Add text labels
  labs(x = " ", y = " ", title = "Top Names with Multiple Type Relationships") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 0, hjust = 0.5, size = 8),  
    legend.position = "bottom"
  ) +
  scale_fill_manual(values = type_new_colors, name = "Type")
print(b4)

Which Companies has the highest Revenue?

Click to show code
library(plotly)
company_revenue <- mc3_nodes %>%
  group_by(id, type_3) %>%
  summarise(total_revenue = sum(revenue_omu, na.rm = TRUE), .groups = 'drop') 

plot <- plot_ly(data = company_revenue, 
                x = ~id,  
                y = ~total_revenue, 
                text = ~paste("Name:", id, 
                              "<br>Total Revenue:", total_revenue,
                              "<br>Type:", type_3), 
                type = "scatter",  
                mode = "markers",  
                marker = list(size = ~total_revenue / 1000000,  
                              sizemode = 'area', 
                              sizeref = 0.1, 
                              color = ~total_revenue, 
                              colorscale = "Viridis", 
                              showscale = TRUE)) 

plot <- plot %>%
  layout(title = "Interactive Bubble Chart of Company Revenue",
         xaxis = list(title = "", showticklabels = FALSE),  # Hide x-axis labels
         yaxis = list(title = "Total Revenue"),
         hovermode = "closest")
plot

Irregular Patterns by Revenue

Extracting year from full date as using the full date format is too detail for analysis.

edges <- edges %>%
  mutate(
    Start_Year = year(start_date),
    End_Year = year(end_date)
  )

nodes <- nodes %>%
  mutate(added_year = year(date_added))
The id with top 8 revenues had been extracted and visualise using heat map.
Click to show code
edges <- edges %>%
  mutate(
    Start_Year = year(start_date),
    End_Year = year(end_date)
  )

nodes <- nodes %>%
  mutate(added_year = year(date_added))

top_8_revenue_nodes <- nodes %>%
  arrange(desc(revenue)) %>%
  slice_head(n = 8)

MC2_node_abnor <- edges %>%
  filter(id %in% top_8_revenue_nodes$id) %>%
  group_by(id, Start_Year) %>%
  summarise(weight = n(), .groups = "drop")

MC2_node_abno <- edges %>%
  filter(id %in% top_8_revenue_nodes$id) %>%
  group_by(id, Start_Year) %>%
  summarise(weight = n(), .groups = "drop")

g1 <- ggplot(MC2_node_abnor, aes(Start_Year, id)) +
  geom_tile(aes(fill = weight)) +
  geom_text(aes(label = weight), size = 3) +
  labs(title = "Irregular Pattern by Revenue") +
  scale_fill_gradient(low = "white", high = "lightblue") +
  theme(axis.text.x = element_text(angle = 0, hjust = 1, size = 8)) +
  theme(axis.text.y = element_text(size = 6))

ggarrange(g1, g2, nrow = 2)

print(g1)

# Load necessary libraries
library(dplyr)
library(igraph)
library(visNetwork)

# Step 1: Ensure mc3_nodes and mc3_edges are loaded
# Assuming mc3_nodes and mc3_edges are already loaded

# Step 2: Preprocess mc3_nodes and mc3_edges

# Preprocess mc3_nodes to include id and type_3
mc3_nodes2 <- mc3_nodes %>%
  mutate(id = as.character(id),
         type_3 = as.character(type_3))

# Preprocess mc3_edges to include source, target, and type_new
mc3_edges2 <- mc3_edges %>%
  mutate(source = as.character(source),
         target = as.character(target),
         type_new = as.character(type_new))

# Extract edges to include relevant source and target, and type_new
edges2 <- mc3_edges2 %>%
  select(from = source, to = target, type_new)

# Extract distinct nodes from source and target
all_nodes <- unique(c(edges2$from, edges2$to))
nodes2 <- mc3_nodes2 %>%
  filter(id %in% all_nodes) %>%
  select(id, type_3) %>%
  distinct() %>%
  rename(name = id, group = type_3)

# Ensure there are no missing nodes
missing_nodes <- setdiff(all_nodes, nodes2$name)
if (length(missing_nodes) > 0) {
  additional_nodes <- data.frame(name = missing_nodes, group = "Unknown")
  nodes2 <- bind_rows(nodes2, additional_nodes)
}

# Remove duplicates
nodes2 <- nodes2 %>%
  distinct(name, .keep_all = TRUE)

# Print nodes and edges for inspection
print("Nodes dataframe:")
[1] "Nodes dataframe:"
print(head(nodes2))
# A tibble: 6 × 2
  name                        group  
  <chr>                       <chr>  
1 Abbott, Mcbride and Edwards Company
2 Abbott-Gomez                Company
3 Abbott-Harrison             Company
4 Abbott-Ibarra               Company
5 Abbott-Sullivan             Company
6 Acevedo and Sons            Company
print("Edges dataframe:")
[1] "Edges dataframe:"
print(head(edges2))
# A tibble: 6 × 3
  from          to                            type_new       
  <chr>         <chr>                         <chr>          
1 Avery Inc     Allen, Nichols and Thompson   Shareholdership
2 Berger-Hayes  Jensen, Morris and Downs      Shareholdership
3 Bowers Group  Barnett Inc                   Shareholdership
4 Bowman-Howe   Bennett Ltd                   Shareholdership
5 Boyd and Sons Armstrong, Bennett and Deleon Shareholdership
6 Brown-Johnson Alexander-Wagner              Shareholdership
# Create the graph object
graph <- graph_from_data_frame(d = edges2, vertices = nodes2, directed = TRUE)

# Convert the graph to data frames for visNetwork
nodes_vis <- data.frame(id = V(graph)$name, label = V(graph)$name, group = V(graph)$group)
edges_vis <- igraph::as_data_frame(graph, what = "edges")

# Print nodes and edges for visNetwork for inspection
print("Nodes for visNetwork:")
[1] "Nodes for visNetwork:"
print(head(nodes_vis))
                           id                       label   group
1 Abbott, Mcbride and Edwards Abbott, Mcbride and Edwards Company
2                Abbott-Gomez                Abbott-Gomez Company
3             Abbott-Harrison             Abbott-Harrison Company
4               Abbott-Ibarra               Abbott-Ibarra Company
5             Abbott-Sullivan             Abbott-Sullivan Company
6            Acevedo and Sons            Acevedo and Sons Company
print("Edges for visNetwork:")
[1] "Edges for visNetwork:"
print(head(edges_vis))
           from                            to        type_new
1     Avery Inc   Allen, Nichols and Thompson Shareholdership
2  Berger-Hayes      Jensen, Morris and Downs Shareholdership
3  Bowers Group                   Barnett Inc Shareholdership
4   Bowman-Howe                   Bennett Ltd Shareholdership
5 Boyd and Sons Armstrong, Bennett and Deleon Shareholdership
6 Brown-Johnson              Alexander-Wagner Shareholdership
# Create and customize the visNetwork graph
vis1 <- visNetwork(
  nodes_vis, 
  edges_vis, 
  width = "100%", 
  main = list(
    text = "Separate Clusters Without Overlapping Ownership",
    style = "font-size:17px; font-weight:bold; text-align:right;"
  )
) %>%
  visIgraphLayout(layout = "layout_with_fr") %>%
  visGroups(groupname = "Unknown", color = "#D3D3D3") %>%
  visGroups(groupname = "Company", color = "#1696d2") %>%
  visLegend() %>%
  visEdges(arrows = "to", color = list(color = "lightgray", highlight = "red")) %>%
  visOptions(
    highlightNearest = list(enabled = TRUE, degree = 2, hover = TRUE),
    nodesIdSelection = TRUE,
    selectedBy = "group",
    collapse = TRUE
  ) %>%
  visInteraction(navigationButtons = TRUE) %>%
  visPhysics(stabilization = FALSE, enabled = FALSE)

# Add additional groups based on edge types
unique_types <- unique(edges2$type_new)
for (t in unique_types) {
  vis1 <- vis1 %>%
    visGroups(groupname = t, color = scales::hue_pal()(length(unique_types))[which(unique_types == t)])
}

# Display the graph
vis1
# Load necessary libraries
library(dplyr)
library(ggraph)
library(igraph)
library(lubridate)
library(ggplot2)
library(ggrepel)


# Extract data with both start_date and end_date
filtered_edges <- edges %>%
  filter(!is.na(start_date) & !is.na(end_date))

# Add year for faceting and coloring
filtered_edges <- filtered_edges %>%
  mutate(year = year(start_date))

# Step 2: Identify the top 30 entities for each type_new based on their frequency
top_entities <- filtered_edges %>%
  group_by(type_new) %>%
  count(id) %>%
  top_n(10, wt = n) %>%
  ungroup() %>%
  pull(id)

# Filter the dataset to include only the top 30 entities of each type_new
filtered_edges_top <- filtered_edges %>%
  filter(id %in% top_entities)

# Create a unique nodes dataframe
nodes <- filtered_edges_top %>%
  select(id, type_new) %>%
  distinct() %>%
  rename(name = id)

# Ensure all nodes in the edge list are in the nodes data frame
all_nodes <- unique(c(filtered_edges_top$id, filtered_edges_top$type_new))
missing_nodes <- setdiff(all_nodes, nodes$name)

if (length(missing_nodes) > 0) {
  additional_nodes <- data.frame(name = missing_nodes, type_new = NA)
  nodes <- bind_rows(nodes, additional_nodes)
}

# Remove duplicates
nodes <- nodes %>%
  distinct(name, .keep_all = TRUE)

# Define custom shapes and colors for each type_new
node_shapes <- c(
  "Fishing-related Companies" = 22,
  "Unknown" = 21,
  "Non-fishing Companies" = 23,
  "Beneficial Owner" = 24,
  "Ultimate Beneficial Owner" = 25,
  "Multi-role Entity" = 20
)

node_colors <- c(
  "Fishing-related Companies" = "blue",
  "Unknown" = "grey",
  "Non-fishing Companies" = "green",
  "Beneficial Owner" = "yellow",
  "Ultimate Beneficial Owner" = "red",
  "Multi-role Entity" = "purple"
)

# Create the graph
graph <- graph_from_data_frame(d = filtered_edges_top, vertices = nodes, directed = TRUE)

# Compute degree centrality
V(graph)$degree <- degree(graph, mode = "all")

# Create a layout
layout <- create_layout(graph, layout = "fr")

# Visualize the graph using ggraph
g <- ggraph(layout) + 
  geom_edge_link(aes(color = factor(year)), alpha = 0.5, edge_width = 0.8) +
  geom_node_point(aes(x = x, y = y, size = degree, shape = type_new, fill = type_new), color = "black", stroke = 0.5) +
  scale_shape_manual(values = node_shapes) +
  scale_fill_manual(values = node_colors) +
  scale_color_manual(values = rainbow(length(unique(filtered_edges_top$year)))) +  # Assign colors to years
  geom_text_repel(aes(x = x, y = y, label = name), size = 2.5) +
  theme_void() +
  theme(legend.position = "right") +
  labs(title = "Network Visualization of Top 30 Entities by Type and Year",
       color = "Year",
       fill = "Type")

# Plot the graph
print(g)

# Load necessary libraries
library(dplyr)
library(ggplot2)
library(ggrepel)

# Extract data with both start_date and end_date
filtered_edges <- edges %>%
  filter(!is.na(start_date) & !is.na(end_date))

# Add year for aggregation and anomaly detection
filtered_edges <- filtered_edges %>%
  mutate(year = year(start_date))

# Calculate the number of activities per `id` and `type_new` for each year
activity_counts <- filtered_edges %>%
  group_by(year, id, type_new) %>%
  summarise(activity_count = n()) %>%
  ungroup()

# Calculate the IQR and identify outliers
activity_counts <- activity_counts %>%
  group_by(type_new) %>%
  mutate(
    Q1 = quantile(activity_count, 0.25),
    Q3 = quantile(activity_count, 0.75),
    IQR = Q3 - Q1,
    lower_bound = Q1 - 1.5 * IQR,
    upper_bound = Q3 + 1.5 * IQR,
    is_outlier = activity_count < lower_bound | activity_count > upper_bound
  ) %>%
  ungroup()

# Plot activity counts and highlight outliers
ggplot(activity_counts, aes(x = year, y = activity_count, color = is_outlier)) +
  geom_point() +
  geom_text_repel(aes(label = ifelse(is_outlier, id, "")), size = 2.5) +
  facet_wrap(~ type_new) +
  labs(title = "Activity Counts per Year with Outliers Highlighted",
       x = "Year", y = "Activity Count", color = "Outlier") +
  theme_minimal()

# Load necessary libraries
library(dplyr)
library(ggplot2)
library(ggrepel)
library(lubridate)

# Assuming edges is already loaded

# Extract data with both start_date and end_date
filtered_edges <- edges %>%
  filter(!is.na(start_date) & !is.na(end_date))

# Add year for aggregation and anomaly detection
filtered_edges <- filtered_edges %>%
  mutate(year = year(start_date))
# Calculate the number of activities per `id` and `type_new` for each year
activity_counts <- filtered_edges %>%
  group_by(year, id, type_new) %>%
  summarise(activity_count = n()) %>%
  ungroup()

# Calculate the IQR and identify outliers
activity_counts <- activity_counts %>%
  group_by(type_new) %>%
  mutate(
    Q1 = quantile(activity_count, 0.25),
    Q3 = quantile(activity_count, 0.75),
    IQR = Q3 - Q1,
    lower_bound = Q1 - 1.5 * IQR,
    upper_bound = Q3 + 1.5 * IQR,
    is_outlier = activity_count < lower_bound | activity_count > upper_bound
  ) %>%
  ungroup()
# Plot activity counts and highlight outliers
ggplot(activity_counts, aes(x = year, y = activity_count, color = is_outlier)) +
  geom_point() +
  geom_text_repel(aes(label = ifelse(is_outlier, id, "")), size = 2.5) +
  facet_wrap(~ type_new, scales = "free_y") +  # Using free_y scale for better visualization
  labs(title = "Activity Counts per Year with Outliers Highlighted",
       x = "Year", y = "Activity Count", color = "Outlier") +
  scale_color_manual(values = c("FALSE" = "blue", "TRUE" = "red")) +  # Color coding outliers
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(size = 14, face = "bold")
  )

# Summarize outliers
outliers_summary <- activity_counts %>%
  filter(is_outlier) %>%
  select(year, id, type_new, activity_count) %>%
  arrange(type_new, year, desc(activity_count))

print(outliers_summary)
# A tibble: 31 × 4
    year id              type_new            activity_count
   <dbl> <chr>           <chr>                        <int>
 1  2014 Wright LLC      BeneficialOwnership              2
 2  2015 Wright LLC      BeneficialOwnership              2
 3  2017 Nichols-Esparza BeneficialOwnership              2
 4  2022 Wright LLC      BeneficialOwnership              2
 5  2023 Nichols-Esparza BeneficialOwnership              2
 6  2027 Smith-Ramirez   BeneficialOwnership              4
 7  2029 Smith-Ramirez   BeneficialOwnership              2
 8  2030 Smith-Ramirez   BeneficialOwnership              2
 9  2031 Smith-Ramirez   BeneficialOwnership              5
10  2032 Stephens-Lopez  BeneficialOwnership              7
# ℹ 21 more rows
# Create a heatmap for outliers
ggplot(outliers_summary, aes(x = year, y = id, fill = activity_count)) +
  geom_tile(color = "white") +
  scale_fill_gradient(low = "lightblue", high = "red") +
  facet_wrap(~ type_new, scales = "free_y") +
  labs(title = "Heatmap of Outliers in Activity Counts",
       x = "Year", y = "Entity", fill = "Activity Count") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(size = 14, face = "bold")
  )